PROSPER LOAN

by TONG LI

Introduction

This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.

In this project, I am going to explore the relationship between the loan amount (LoanOriginalAmount in the dataset) and some other variables selected from the original dataset.

Data Analysis

1. Variables selection through correlation matrix and univariate plots

1.1 General selection

Since the dataset contains many variables, before doing any data analysis I selected 10-15 variables I would like to explore. First, I excluded all variables with proportion of missing values larger than 10%.

## [1] 113937     81

1.2 Numeric variables selection

Second, in the variables with missing values fewer than 10%, I selected all the numeric variables and explore their correlation with LoanOriginalAmount. Because there are too many variables and an ordinary correlation matrix will be too big (47 by 47), I will use a flattened correlation table instead. This table contains four columns: (variable) i, (variable) j, cor, & p. Each row contains the correlation coefficient and the p value of one pair of the numeric variables.

By doing the above steps, in addition to LoanOriginalAmount, I selected 6 numeric variables whose correlation magnitude with LoanOriginalAmount were approximately between 0.3 and 0.9.

1.3 Factors selection

Next, I selected appropriate factors by plotting their distributions and excluding the ones with extreme ceiling or floor.

Four factors were selected.

Then the 7 numeric variales and 4 factors were selected from the dataset. A new variable “LoanOriginalQuarter” was created from the original variable “LoanOriginationQuarter”, with only Q1, Q2, Q3 and Q4 information in the new variable but not the specific years.

2. Univariate Analysis

2.1 Univariate Plots

## 'data.frame':    113937 obs. of  11 variables:
##  $ LoanOriginalAmount   : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ MonthlyLoanPayment   : num  330 319 123 321 564 ...
##  $ Investors            : int  258 1 41 158 20 1 1 1 1 1 ...
##  $ LP_CustomerPayments  : num  11396 0 4187 5143 2820 ...
##  $ LP_ServiceFees       : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ Term                 : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ CreditScoreRangeLower: int  640 680 480 800 680 740 680 700 820 820 ...
##  $ IncomeRange          : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IsBorrowerHomeowner  : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ LoanOriginalQuarter  : Factor w/ 4 levels "Q1","Q2","Q3",..: 3 1 1 4 3 4 2 2 4 4 ...
##  $ EmploymentStatus     : Factor w/ 8 levels "Employed","Full-time",..: 8 1 3 1 1 1 1 1 1 1 ...
##  LoanOriginalAmount MonthlyLoanPayment   Investors      
##  Min.   : 1000      Min.   :   0.0     Min.   :   1.00  
##  1st Qu.: 4000      1st Qu.: 131.6     1st Qu.:   2.00  
##  Median : 6500      Median : 217.7     Median :  44.00  
##  Mean   : 8337      Mean   : 272.5     Mean   :  80.48  
##  3rd Qu.:12000      3rd Qu.: 371.6     3rd Qu.: 115.00  
##  Max.   :35000      Max.   :2251.5     Max.   :1189.00  
##                                                         
##  LP_CustomerPayments LP_ServiceFees         Term      
##  Min.   :   -2.35    Min.   :-664.87   Min.   :12.00  
##  1st Qu.: 1005.76    1st Qu.: -73.18   1st Qu.:36.00  
##  Median : 2583.83    Median : -34.44   Median :36.00  
##  Mean   : 4183.08    Mean   : -54.73   Mean   :40.83  
##  3rd Qu.: 5548.40    3rd Qu.: -13.92   3rd Qu.:36.00  
##  Max.   :40702.39    Max.   :  32.06   Max.   :60.00  
##                                                       
##  CreditScoreRangeLower         IncomeRange    IsBorrowerHomeowner
##  Min.   :  0.0         $25,000-49,999:32192   False:56459        
##  1st Qu.:660.0         $50,000-74,999:31050   True :57478        
##  Median :680.0         $100,000+     :17337                      
##  Mean   :685.6         $75,000-99,999:16916                      
##  3rd Qu.:720.0         Not displayed : 7741                      
##  Max.   :880.0         $1-24,999     : 7274                      
##  NA's   :591           (Other)       : 1427                      
##  LoanOriginalQuarter      EmploymentStatus
##  Q1:29678            Employed     :67322  
##  Q2:24906            Full-time    :26355  
##  Q3:27967            Self-employed: 6134  
##  Q4:31386            Not available: 5347  
##                      Other        : 3806  
##                      (Other)      : 2718  
##                      NA's         : 2255

Histogram of numeric variables:

Although Term is a numeric variable, it actually only has three levels: 12, 36, and 60. Therefore I plot Term again as a factor

All those numeric variables have skewed distributions (except for Term), and they are also on very different scales. Therefore they need to be standardized and transformed. Before doing the log transformation for each variable, an appropriate integer will be added to each datum based on the range of the standardized values to avoid NAN generated by zero values. The new distributions (after being standardized and log transformed) are as follows:

##  LoanOriginalAmount MonthlyLoanPayment   Investors      
##  Min.   :-1.1747    Min.   :-1.4140    Min.   :-0.7698  
##  1st Qu.:-0.6944    1st Qu.:-0.7310    1st Qu.:-0.7601  
##  Median :-0.2941    Median :-0.2841    Median :-0.3533  
##  Mean   : 0.0000    Mean   : 0.0000    Mean   : 0.0000  
##  3rd Qu.: 0.5865    3rd Qu.: 0.5143    3rd Qu.: 0.3344  
##  Max.   : 4.2689    Max.   :10.2701    Max.   :10.7375  
##                                                         
##  LP_CustomerPayments LP_ServiceFees     CreditScoreRangeLower
##  Min.   :-0.8736     Min.   :-10.0559   Min.   :-10.3158     
##  1st Qu.:-0.6632     1st Qu.: -0.3041   1st Qu.: -0.3847     
##  Median :-0.3338     Median :  0.3343   Median : -0.0838     
##  Mean   : 0.0000     Mean   :  0.0000   Mean   :  0.0000     
##  3rd Qu.: 0.2850     3rd Qu.:  0.6725   3rd Qu.:  0.5181     
##  Max.   : 7.6226     Max.   :  1.4303   Max.   :  2.9256     
##                                         NA's   :591

Distribution of factors:

The order of the IncomeRange levels needs to be rearranged. In addition, the labels on the x axis were too long and some were overlapped. After adjusted it was plotted again:

The types of Employment Status were confusing: ‘Full-time’, ‘Part-time’ and ‘Self-employed’ should all be considered as ‘Employed’. Therefore, another variable was created in which all the three types were labeled as ‘Employed’ as well.

However, since the loans with a status of ‘Employed’ were much more than the loans with other kinds of status, the Modified Employment Status was probably not a very good variable, if I want to explore its relationship with LoanOriginalAmount.

2.2 Analysis

What is the structure of your dataset?

There are 113,937 records in my dataset, with 11 variables. Seven variables are numeric variables, including LoanOriginalAmount, MonthlyLoanPayment, Investors, LP_CustomerPayments, LP_ServiceFees, Term, and CreditScoreRangeLower. Four variables are factors; their names and levels are as follows:

IncomeRange: Not employed, $0, $1-24,999, $25,000-49,999, $50,000-74,999, $75,000-99,999, $100,000+, Not displayed IsBorrowerHomeowner: Ture, False
LoanOriginalQuarter: Q1, Q2, Q3, Q4
EmploymentStatus2: Employed, Retired, Not employed, Other, Not available, NA (created based on EmploymentStatus)

Other observations:
1. The numeric variables are correlated with LoanOriginalAmount, with the magnitude of correlation approximately between 0.3 and 0.9.
2. The distribution of most numeric variables are not normal, and are on very different scales. Therefore, standardization and log transformation were used on LoanOriginalAmount, MonthlyLoanPayment, Investors, LP_CustomerPayments LP_ServiceFees, and CreditScoreRangeLower.
3. In the modified Employment Status (EmploymentStatus2), the ‘Employed’ status takes a proportion of more than 80%, therefore it might not be a very good predictor.

What is/are the main feature(s) of interest in your dataset? What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The main feature of interest is the origination amount of the loan (LoanOriginalAmount). I’d like to see which features are best for predicting the original amount of loan. I have already selected numeric variables which are moderately correlated to LoanOriginalAmount, and factors without extreme distributions, and I want to see whether some combination of these variables can be predict the original amount of loan.

Did you create any new variables from existing variables in the dataset?

Two new variables were created:
1. LoanOriginalQuarter: created from LoanOriginateQuarter. The new variable only contains in which quarter (Q1, Q2, Q3 or Q4) the loan was made, without the year information.
2. EmploymentStatus2: created from EmploymentStatus. The original variable includes ‘Employed’, ‘Full-time’, ‘Part-time’, ‘Self-employed’ and several other levels. However this is confusing because obviously ‘Employed’ should include the latter three levels. Therefore in the new variable, the level ‘Full-time’, ‘Part-time’ and ‘Self-employed’ were all relabeled as ‘Employed’.

3. Bivariate Analysis

3.1 Bivariate Plots

From now on, I will use the new data set created from the previous step which contains transformed numeric variables and adjusted factors.

Generally explore the relationship between the variables in the dataset:

After standardization and log transformation, some correlations were different from the correlation with original variables. Next, I will used plots to show the relationship between the transformed variables:

Although the correlation between LoanOriginalAmount and MonthlyLoanPayment is high (above 0.9), this plot shows that the relationship between the two variables is probably moderated by a third variable. It looks that three regression lines are needed here to capture the linear relationship between LoanOriginalAmount and MonthlyLoanPayment.

This plot shows a weak positive correlation between the two variables.

This plot shows a weak-to-medium correlation between the two variables.

Scatter plot to show the relationship between ServiceFees and LoanOriginalAmount:

This plot shows a medium negative correlation between the two variables.

This plot shows a positive medium correlation between the two variables. It looks that the correlation was affected by extreme values.

This plot shows that the amount of loan for 12, 36 and 60 months of term was different, with more amount corresponding to longer term.

To explore the Loan Original Amount between different lengths of Term:

##                 Df Sum Sq Mean Sq F value Pr(>F)    
## Term             2    605  302.75    8637 <2e-16 ***
## Residuals   113934   3994    0.04                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  newdata$LoanOriginalAmount and newdata$Term 
## 
##    12     36    
## 36 <2e-16 -     
## 60 <2e-16 <2e-16
## 
## P value adjustment method: bonferroni

So ANOVA test shows that Loan Amount significantly increases with Term length.

The general trend was that with higher Income, the mean and variance of loan amount was also higher. The exception was at the income of 0. This might be because people receive 0 in their income for different reasons (temparory job, internship, etc.), so for the zero-income people their loan amount may be more likely to depend on other variables.

To explore the difference of loan amount by income range:

##                 Df Sum Sq Mean Sq F value Pr(>F)    
## IncomeRange      7    837  119.54    3620 <2e-16 ***
## Residuals   113929   3762    0.03                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  newdata$LoanOriginalAmount and newdata$IncomeRange 
## 
##                Not employed $0      $1-24,999 $25,000-49,999
## $0             1.1e-12      -       -         -             
## $1-24,999      0.0084       < 2e-16 -         -             
## $25,000-49,999 < 2e-16      0.7689  < 2e-16   -             
## $50,000-74,999 < 2e-16      < 2e-16 < 2e-16   < 2e-16       
## $75,000-99,999 < 2e-16      < 2e-16 < 2e-16   < 2e-16       
## $100,000+      < 2e-16      < 2e-16 < 2e-16   < 2e-16       
## Not displayed  1.0000       < 2e-16 8.7e-14   < 2e-16       
##                $50,000-74,999 $75,000-99,999 $100,000+
## $0             -              -              -        
## $1-24,999      -              -              -        
## $25,000-49,999 -              -              -        
## $50,000-74,999 -              -              -        
## $75,000-99,999 < 2e-16        -              -        
## $100,000+      < 2e-16        < 2e-16        -        
## Not displayed  < 2e-16        < 2e-16        < 2e-16  
## 
## P value adjustment method: bonferroni

The ANOVA test shows that from $1-24,999 above, loan amount increases with income range.

The loan amount for home owners looks larger than that for non home owners, and larger variance of loan amount was found for home owners.

## 
##  Welch Two Sample t-test
## 
## data:  newdata$LoanOriginalAmount by newdata$IsBorrowerHomeowner
## t = -71.593, df = 113170, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.08558477 -0.08102359
## sample estimates:
## mean in group False  mean in group True 
##           0.2107919           0.2940961

The independent t test confirms that overall, home owners have larger loan amount.

The loan amount in different quarter was not very different from one another.

##                         Df Sum Sq Mean Sq F value Pr(>F)    
## LoanOriginalQuarter      3     54   17.84   447.2 <2e-16 ***
## Residuals           113933   4546    0.04                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  newdata$LoanOriginalAmount and newdata$LoanOriginalQuarter 
## 
##    Q1     Q2     Q3    
## Q2 <2e-16 -      -     
## Q3 <2e-16 0.032  -     
## Q4 <2e-16 <2e-16 <2e-16
## 
## P value adjustment method: bonferroni
## # A tibble: 4 x 3
##   LoanOriginalQuarter mean_amount     n
##                <fctr>       <dbl> <int>
## 1                  Q1   0.2804641 29678
## 2                  Q2   0.2279122 24906
## 3                  Q3   0.2327582 27967
## 4                  Q4   0.2643094 31386

The AVOVA test shows difference of loan amount among quarters. To be specific, loan amount in Q1 > Q4 > Q3 > Q2.

From this plot, the loan amount of employed borrowers has larger variance and higher mean compared to that of borrowers with other employment status.

##                       Df Sum Sq Mean Sq F value Pr(>F)    
## EmploymentStatus2      4    119  29.858     762 <2e-16 ***
## Residuals         111677   4376   0.039                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2255 observations deleted due to missingness
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  newdata$LoanOriginalAmount and newdata$EmploymentStatus2 
## 
##               Employed Retired Not employed Other 
## Retired       <2e-16   -       -            -     
## Not employed  <2e-16   1.00    -            -     
## Other         <2e-16   <2e-16  <2e-16       -     
## Not available <2e-16   0.54    1.00         <2e-16
## 
## P value adjustment method: bonferroni

So employed borrowers tend to have larger amount of loan compared to borrowers with other employment status. No difference is found between retired and not employed borrowers.

In addition to the main variable (LoanOriginalAmount) that I focused on, I also looked into the relationship between other variables, and found a relationship between pre charge-off cumulative gross payments made by the borrower on the loan (LP_CustomerPayments) and Cumulative service fees paid by the investors who have invested in the loan (LP_ServiceFees).

3.2 Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Observed relationships are as follows:
1. The plots revealed a strong positive correlation between LoanOriginalAmount and MonthlyLoanPayment, which means that people with larger amount of original loan tend to schedule larger amount of monthly loan payment.
2. A weak positive correlation was found between LoanOriginalAmount and the number of investors that funded the loan, which shows that the amount of loan tends to be larger with more investors.
3. A weak positive correlation was found between LoanOriginalAmount and pre charge-off cumulative gross payments made by the borrower (LP_CustomerPayments), which means that the amount of loan tends to be larger if the borrower has made higher pre charge-off payment.
4. A medium negative correlation was found between LoanOriginalAmount and cumulative service fees paid by the investors who have invested in the loan, which means that the amount of loan tends to be larger if the investors have paid more service fees.
5. A weak positive correlation was found between LoanOriginalAmount and the lower value representing the range of the borrower’s credit score as provided by a consumer credit rating agency. It means that borrowers with higher credit score tend to have larger amount of loan.
6. Larger amount of loan tends to have longer term.
7. Borrowers with their own home tend to have larger amount of loan.
8. In which quarter the loan was created seems to affect the amount of loan.
9. In general, more income is related to larger amount of loan.
10. Borrowers who are employed tend to have larger amount of loan.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

An interesting relationship is that there is a relatively strong negative correlation between pre charge-off cumulative gross payments made by the borrower on the loan and cumulative service fees paid by the investors who have invested in the loan. Therefore, if the borrower pay more for the pre charge-off fees, the investors will pay less for the service fees.

What was the strongest relationship you found?

The strongest relationship was the correlation between LoanOriginalAmount and MonthlyLoanPayment.

4. Multivariate Analysis

4.1 Multivariate Plots

The scatter plot of MonthlyLoanPayment and LoanOriginalAmount showed that three lines were possibly needed to capture the linear relationship between the two variables. Since Term was the only variable with three levels, I first investigated whether different levels of Term would affect the relationship between the two variables.

From the plot, the correlation between LoanOriginalAmount and MonthlyLoanPayment exhibited very good linear relationship on each level of Term. Therefore there should be a interaction between Term and MonthlyLoanPayment when predicting LoanOriginalAmount.

To see whether the relationship between Investors and LoanOriginalAmount is moderated by other variables, I tried several moderators and the plots were as follows:

From the above plots, the relationship between Investors and LoanOriginalAmount might be moderated by Term and IncomeRange. Other variables did not show a strong moderating effect.

To see whether the relationship between LP_CustomerPayments and LoanOriginalAmount is moderated by other variables:

Although in some of the above plots, the regression lines on different levels of a thid variable showed different slopes, more careful examination of the patterns of the dots suggested that most of the different slopes were probably driven by some outliers, or the extreme unbalance between the number of datum in each condition, but not the difference in the patterns of the relationship. Therefore, the only variables was found to moderate the relationship between LP_CustomerPayments and LoanOriginalAmount was LoanOriginalQuarter.

To see whether the relationship between LP_ServiceFees and LoanOriginalAmount is moderated by other variables:

This situation was similar to the previous one. No variable was found to moderate the relationship between LP_ServiceFees and LoanOriginalAmount.

To see whether the relationship between CreditScoreRangeLower and LoanOriginalAmount was moderated by other variables:

The situation looked like the ones with LP_ServiceFees, no other variable was found to moderate the relationship between CreditScoreRangeLower and LoanOriginalAmount.

It seems that there is an interaction between Term and whether the borrower is a home owner.

It seems that the length of term did not affect the relationship between income range and loan amount.

Again, the length of term did not affect the relationship between loan amount and in which quarter the loan was created.

It seems that whether the borrower is a home owner did not affect the relationship between the income range and the amount of loan.

It seems that in which quarter the loan was created did not affect the relationship between borrowers’ income range and the amount of loan.

4.2 Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there any interesting or surprising interactions between features?

The observed moderating effects/interactions are as follows:
1. The relationship between loan amount and monthly payment was moderated by the term of the loan. With longer loan term, larger amount of loan tend to relate even more monthly payment.
2. The relationship between loan amount and number of investors was moderated by term length and borrowers’ income. In general, more investors is related to larger amount of loan. With longer term, more investors is still related to larger amount of loan, but not as large as that with shorter term. On the other hand, with higher income, more investors is also related to larger amount of loan but the amount is not that large as for borrowers with less income.
3. An interaction was found between cumulative service fees paid by the investors who had invested in the loan and in which quarter the loan was created.
4. An interaction was found between loan term and whether the borrower was a home owner.

Model with my dataset:

Based on the univariate, bivariate and mutivariate analysis, I’m going to create a multiple regression model to predict original loan amount, including the variables that are correlated to LoanOriginalAmount, and the interactions found from the plots. I will add the variables to the model one after another, and then add interactions, meanwhile comparing the models using ANOVA. One noteworthy thing is that although EmploymentStatus2 seems to affect loan amount, more than 88% of the employment status was employed, therefore the unbalance between different employment status makes it not a good predictor. Therefore EmploymentStatus2 will not be added in the model.

there is an apparant interaction between IsBorrowerHomeowner and Term, I will not include this in the model, for the variable IsBorrowerHomeowner itself seems not to be a good predictor based on other plots, therefore the interaction should not be included. The situation with Quarter is similar: although an interaction between LP_ServiceFees and LoanOriginalQuarter was found, based on the bivariate plot, Quarter does not affect loan amount, therefore quarter and its interaction will not be included in the model.

## Analysis of Variance Table
## 
## Model  1: LoanOriginalAmount ~ MonthlyLoanPayment
## Model  2: LoanOriginalAmount ~ MonthlyLoanPayment + Investors
## Model  3: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments
## Model  4: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments + 
##     LP_ServiceFees
## Model  5: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments + 
##     LP_ServiceFees + Term
## Model  6: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments + 
##     LP_ServiceFees + Term + CreditScoreRangeLower
## Model  7: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments + 
##     LP_ServiceFees + Term + CreditScoreRangeLower + IncomeRange
## Model  8: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments + 
##     LP_ServiceFees + Term + CreditScoreRangeLower + IncomeRange + 
##     IsBorrowerHomeowner
## Model  9: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments + 
##     LP_ServiceFees + Term + CreditScoreRangeLower + IncomeRange + 
##     IsBorrowerHomeowner + LoanOriginalQuarter
## Model 10: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments + 
##     LP_ServiceFees + Term + CreditScoreRangeLower + IncomeRange + 
##     IsBorrowerHomeowner + LoanOriginalQuarter + MonthlyLoanPayment:Term
## Model 11: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments + 
##     LP_ServiceFees + Term + CreditScoreRangeLower + IncomeRange + 
##     IsBorrowerHomeowner + LoanOriginalQuarter + MonthlyLoanPayment:Term + 
##     Investors:IncomeRange
## Model 12: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments + 
##     LP_ServiceFees + Term + CreditScoreRangeLower + IncomeRange + 
##     IsBorrowerHomeowner + LoanOriginalQuarter + MonthlyLoanPayment:Term + 
##     Investors:IncomeRange + Investors:Term
## Model 13: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments + 
##     LP_ServiceFees + Term + CreditScoreRangeLower + IncomeRange + 
##     IsBorrowerHomeowner + LoanOriginalQuarter + MonthlyLoanPayment:Term + 
##     Investors:IncomeRange + Investors:Term + Term:IsBorrowerHomeowner
##    Res.Df    RSS Df Sum of Sq          F    Pr(>F)    
## 1  111678 466.18                                      
## 2  111677 466.13  1     0.048    32.9398 9.530e-09 ***
## 3  111676 460.89  1     5.239  3573.5462 < 2.2e-16 ***
## 4  111675 423.39  1    37.501 25580.6246 < 2.2e-16 ***
## 5  111673 227.10  2   196.287 66947.4375 < 2.2e-16 ***
## 6  111672 221.40  1     5.701  3888.8565 < 2.2e-16 ***
## 7  111665 213.76  7     7.646   745.1363 < 2.2e-16 ***
## 8  111664 213.67  1     0.087    59.3399 1.337e-14 ***
## 9  111661 212.73  3     0.941   214.0580 < 2.2e-16 ***
## 10 111659 169.40  2    43.323 14776.1865 < 2.2e-16 ***
## 11 111652 169.34  7     0.061     5.9451 6.184e-07 ***
## 12 111650 163.77  2     5.577  1902.2993 < 2.2e-16 ***
## 13 111648 163.67  2     0.092    31.5033 2.100e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Calls:
## m1: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment, data = newdata_comp)
## m2: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors, 
##     data = newdata_comp)
## m3: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments, data = newdata_comp)
## m4: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees, data = newdata_comp)
## 
## ===========================================================================
##                            m1           m2           m3           m4       
## ---------------------------------------------------------------------------
##   (Intercept)            0.016***     0.016***     0.012***     0.616***   
##                         (0.000)      (0.000)      (0.000)      (0.006)     
##   MonthlyLoanPayment     0.938***     0.938***     0.946***     0.910***   
##                         (0.001)      (0.001)      (0.001)      (0.001)     
##   Investors                           0.002***     0.013***     0.006***   
##                                      (0.001)      (0.001)      (0.001)     
##   LP_CustomerPayments                             -0.021***    -0.056***   
##                                                   (0.001)      (0.001)     
##   LP_ServiceFees                                               -0.579***   
##                                                                (0.006)     
## ---------------------------------------------------------------------------
##   R-squared                  0.896        0.896        0.897        0.906  
##   adj. R-squared             0.896        0.896        0.897        0.906  
##   sigma                      0.065        0.065        0.064        0.062  
##   F                     965265.838   482684.380   325867.434   268518.131  
##   p                          0.000        0.000        0.000        0.000  
##   Log-likelihood        147470.456   147476.241   148107.369   152846.341  
##   Deviance                 466.179      466.131      460.892      423.391  
##   AIC                  -294934.913  -294944.482  -296204.738  -305680.683  
##   BIC                  -294906.043  -294905.988  -296156.621  -305622.943  
##   N                     111680       111680       111680       111680      
## ===========================================================================
## 
## Calls:
## m4: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees, data = newdata_comp)
## m5: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term, data = newdata_comp)
## m6: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower, 
##     data = newdata_comp)
## m7: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower + 
##     IncomeRange, data = newdata_comp)
## 
## ================================================================================================
##                                                 m4           m5           m6           m7       
## ------------------------------------------------------------------------------------------------
##   (Intercept)                                 0.616***     0.047***    -0.103***    -0.129***   
##                                              (0.006)      (0.005)      (0.006)      (0.006)     
##   MonthlyLoanPayment                          0.910***     0.895***     0.887***     0.874***   
##                                              (0.001)      (0.001)      (0.001)      (0.001)     
##   Investors                                   0.006***     0.019***     0.017***     0.017***   
##                                              (0.001)      (0.000)      (0.000)      (0.000)     
##   LP_CustomerPayments                        -0.056***    -0.021***    -0.021***    -0.023***   
##                                              (0.001)      (0.001)      (0.000)      (0.000)     
##   LP_ServiceFees                             -0.579***    -0.200***    -0.201***    -0.210***   
##                                              (0.006)      (0.004)      (0.004)      (0.004)     
##   Term: 36/12                                              0.170***     0.172***     0.171***   
##                                                           (0.001)      (0.001)      (0.001)     
##   Term: 60/12                                              0.262***     0.262***     0.262***   
##                                                           (0.001)      (0.001)      (0.001)     
##   CreditScoreRangeLower                                                 0.144***     0.161***   
##                                                                        (0.003)      (0.003)     
##   IncomeRange: $0/Not employed                                                       0.025***   
##                                                                                     (0.002)     
##   IncomeRange: $1-24,999/Not employed                                                0.010***   
##                                                                                     (0.002)     
##   IncomeRange: $25,000-49,999/Not employed                                           0.014***   
##                                                                                     (0.002)     
##   IncomeRange: $50,000-74,999/Not employed                                           0.023***   
##                                                                                     (0.002)     
##   IncomeRange: $75,000-99,999/Not employed                                           0.028***   
##                                                                                     (0.002)     
##   IncomeRange: $100,000+/Not employed                                                0.034***   
##                                                                                     (0.002)     
##   IncomeRange: Not displayed/Not employed                                            0.040***   
##                                                                                     (0.002)     
## ------------------------------------------------------------------------------------------------
##   R-squared                                       0.906        0.949        0.951        0.952  
##   adj. R-squared                                  0.906        0.949        0.951        0.952  
##   sigma                                           0.062        0.045        0.045        0.044  
##   F                                          268518.131   349814.089   307969.395   159768.359  
##   p                                               0.000        0.000        0.000        0.000  
##   Log-likelihood                             152846.341   187628.478   189048.125   191010.735  
##   Deviance                                      423.391      227.104      221.403      213.756  
##   AIC                                       -305680.683  -375240.957  -378078.250  -381989.471  
##   BIC                                       -305622.943  -375163.970  -377991.639  -381835.497  
##   N                                          111680       111680       111680       111680      
## ================================================================================================
## 
## Calls:
## m7: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower + 
##     IncomeRange, data = newdata_comp)
## m8: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower + 
##     IncomeRange + IsBorrowerHomeowner, data = newdata_comp)
## m9: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower + 
##     IncomeRange + IsBorrowerHomeowner + LoanOriginalQuarter, 
##     data = newdata_comp)
## m10: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower + 
##     IncomeRange + IsBorrowerHomeowner + LoanOriginalQuarter + 
##     MonthlyLoanPayment:Term, data = newdata_comp)
## 
## ================================================================================================
##                                                 m7           m8           m9           m10      
## ------------------------------------------------------------------------------------------------
##   (Intercept)                                -0.129***    -0.126***    -0.118***    -0.034***   
##                                              (0.006)      (0.006)      (0.006)      (0.005)     
##   MonthlyLoanPayment                          0.874***     0.874***     0.871***     0.382***   
##                                              (0.001)      (0.001)      (0.001)      (0.004)     
##   Investors                                   0.017***     0.017***     0.016***     0.018***   
##                                              (0.000)      (0.000)      (0.000)      (0.000)     
##   LP_CustomerPayments                        -0.023***    -0.023***    -0.020***    -0.017***   
##                                              (0.000)      (0.000)      (0.001)      (0.000)     
##   LP_ServiceFees                             -0.210***    -0.210***    -0.211***    -0.137***   
##                                              (0.004)      (0.004)      (0.004)      (0.004)     
##   Term: 36/12                                 0.171***     0.171***     0.172***     0.015***   
##                                              (0.001)      (0.001)      (0.001)      (0.002)     
##   Term: 60/12                                 0.262***     0.262***     0.263***     0.056***   
##                                              (0.001)      (0.001)      (0.001)      (0.002)     
##   CreditScoreRangeLower                       0.161***     0.158***     0.156***     0.154***   
##                                              (0.003)      (0.003)      (0.003)      (0.003)     
##   IncomeRange: $0/Not employed                0.025***     0.025***     0.025***     0.024***   
##                                              (0.002)      (0.002)      (0.002)      (0.002)     
##   IncomeRange: $1-24,999/Not employed         0.010***     0.010***     0.010***     0.010***   
##                                              (0.002)      (0.002)      (0.002)      (0.001)     
##   IncomeRange: $25,000-49,999/Not employed    0.014***     0.014***     0.013***     0.014***   
##                                              (0.002)      (0.002)      (0.002)      (0.001)     
##   IncomeRange: $50,000-74,999/Not employed    0.023***     0.022***     0.022***     0.020***   
##                                              (0.002)      (0.002)      (0.002)      (0.001)     
##   IncomeRange: $75,000-99,999/Not employed    0.028***     0.027***     0.027***     0.024***   
##                                              (0.002)      (0.002)      (0.002)      (0.001)     
##   IncomeRange: $100,000+/Not employed         0.034***     0.033***     0.033***     0.030***   
##                                              (0.002)      (0.002)      (0.002)      (0.001)     
##   IncomeRange: Not displayed/Not employed     0.040***     0.039***     0.037***     0.035***   
##                                              (0.002)      (0.002)      (0.002)      (0.001)     
##   IsBorrowerHomeowner: True/False                          0.002***     0.002***     0.001***   
##                                                           (0.000)      (0.000)      (0.000)     
##   LoanOriginalQuarter: Q2/Q1                                           -0.008***    -0.007***   
##                                                                        (0.000)      (0.000)     
##   LoanOriginalQuarter: Q3/Q1                                           -0.007***    -0.007***   
##                                                                        (0.000)      (0.000)     
##   LoanOriginalQuarter: Q4/Q1                                           -0.003***    -0.004***   
##                                                                        (0.000)      (0.000)     
##   MonthlyLoanPayment x Term: 36/12                                                   0.488***   
##                                                                                     (0.004)     
##   MonthlyLoanPayment x Term: 60/12                                                   0.650***   
##                                                                                     (0.004)     
## ------------------------------------------------------------------------------------------------
##   R-squared                                       0.952        0.952        0.953        0.962  
##   adj. R-squared                                  0.952        0.952        0.953        0.962  
##   sigma                                           0.044        0.044        0.044        0.039  
##   F                                          159768.359   149179.540   124890.536   142572.058  
##   p                                               0.000        0.000        0.000        0.000  
##   Log-likelihood                             191010.735   191033.465   191280.036   203996.114  
##   Deviance                                      213.756      213.669      212.728      169.405  
##   AIC                                       -381989.471  -382032.930  -382520.073  -407948.228  
##   BIC                                       -381835.497  -381869.332  -382327.605  -407736.513  
##   N                                          111680       111680       111680       111680      
## ================================================================================================
## 
## Calls:
## m10: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower + 
##     IncomeRange + IsBorrowerHomeowner + LoanOriginalQuarter + 
##     MonthlyLoanPayment:Term, data = newdata_comp)
## m11: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower + 
##     IncomeRange + IsBorrowerHomeowner + LoanOriginalQuarter + 
##     MonthlyLoanPayment:Term + Investors:IncomeRange, data = newdata_comp)
## m12: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower + 
##     IncomeRange + IsBorrowerHomeowner + LoanOriginalQuarter + 
##     MonthlyLoanPayment:Term + Investors:IncomeRange + Investors:Term, 
##     data = newdata_comp)
## m13: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower + 
##     IncomeRange + IsBorrowerHomeowner + LoanOriginalQuarter + 
##     MonthlyLoanPayment:Term + Investors:IncomeRange + Investors:Term + 
##     Term:IsBorrowerHomeowner, data = newdata_comp)
## 
## ============================================================================================================
##                                                             m10          m11          m12          m13      
## ------------------------------------------------------------------------------------------------------------
##   (Intercept)                                            -0.034***    -0.033***     0.054***     0.054***   
##                                                          (0.005)      (0.005)      (0.005)      (0.006)     
##   MonthlyLoanPayment                                      0.382***     0.382***     0.281***     0.281***   
##                                                          (0.004)      (0.004)      (0.004)      (0.004)     
##   Investors                                               0.018***     0.026***     0.260***     0.260***   
##                                                          (0.000)      (0.005)      (0.006)      (0.006)     
##   LP_CustomerPayments                                    -0.017***    -0.017***    -0.018***    -0.018***   
##                                                          (0.000)      (0.000)      (0.000)      (0.000)     
##   LP_ServiceFees                                         -0.137***    -0.137***    -0.153***    -0.153***   
##                                                          (0.004)      (0.004)      (0.004)      (0.004)     
##   Term: 36/12                                             0.015***     0.015***    -0.053***    -0.053***   
##                                                          (0.002)      (0.002)      (0.002)      (0.002)     
##   Term: 60/12                                             0.056***     0.056***    -0.019***    -0.022***   
##                                                          (0.002)      (0.002)      (0.002)      (0.002)     
##   CreditScoreRangeLower                                   0.154***     0.155***     0.155***     0.155***   
##                                                          (0.003)      (0.003)      (0.002)      (0.002)     
##   IncomeRange: $0/Not employed                            0.024***     0.022***     0.021***     0.021***   
##                                                          (0.002)      (0.002)      (0.002)      (0.002)     
##   IncomeRange: $1-24,999/Not employed                     0.010***     0.010***     0.008***     0.008***   
##                                                          (0.001)      (0.002)      (0.002)      (0.002)     
##   IncomeRange: $25,000-49,999/Not employed                0.014***     0.013***     0.012***     0.011***   
##                                                          (0.001)      (0.002)      (0.002)      (0.002)     
##   IncomeRange: $50,000-74,999/Not employed                0.020***     0.018***     0.017***     0.017***   
##                                                          (0.001)      (0.002)      (0.002)      (0.002)     
##   IncomeRange: $75,000-99,999/Not employed                0.024***     0.022***     0.021***     0.021***   
##                                                          (0.001)      (0.002)      (0.002)      (0.002)     
##   IncomeRange: $100,000+/Not employed                     0.030***     0.028***     0.027***     0.027***   
##                                                          (0.001)      (0.002)      (0.002)      (0.002)     
##   IncomeRange: Not displayed/Not employed                 0.035***     0.032***     0.031***     0.031***   
##                                                          (0.001)      (0.002)      (0.002)      (0.002)     
##   IsBorrowerHomeowner: True/False                         0.001***     0.001***     0.001***     0.000      
##                                                          (0.000)      (0.000)      (0.000)      (0.002)     
##   LoanOriginalQuarter: Q2/Q1                             -0.007***    -0.007***    -0.007***    -0.007***   
##                                                          (0.000)      (0.000)      (0.000)      (0.000)     
##   LoanOriginalQuarter: Q3/Q1                             -0.007***    -0.007***    -0.007***    -0.007***   
##                                                          (0.000)      (0.000)      (0.000)      (0.000)     
##   LoanOriginalQuarter: Q4/Q1                             -0.004***    -0.004***    -0.004***    -0.004***   
##                                                          (0.000)      (0.000)      (0.000)      (0.000)     
##   MonthlyLoanPayment x Term: 36/12                        0.488***     0.488***     0.587***     0.587***   
##                                                          (0.004)      (0.004)      (0.004)      (0.004)     
##   MonthlyLoanPayment x Term: 60/12                        0.650***     0.650***     0.757***     0.755***   
##                                                          (0.004)      (0.004)      (0.004)      (0.004)     
##   Investors x IncomeRange: $0/Not employed                            -0.005       -0.005       -0.006      
##                                                                       (0.007)      (0.007)      (0.007)     
##   Investors x IncomeRange: $1-24,999/Not employed                     -0.003       -0.001       -0.002      
##                                                                       (0.005)      (0.005)      (0.005)     
##   Investors x IncomeRange: $25,000-49,999/Not employed                -0.007       -0.003       -0.004      
##                                                                       (0.005)      (0.005)      (0.005)     
##   Investors x IncomeRange: $50,000-74,999/Not employed                -0.010*      -0.006       -0.006      
##                                                                       (0.005)      (0.005)      (0.005)     
##   Investors x IncomeRange: $75,000-99,999/Not employed                -0.008       -0.005       -0.005      
##                                                                       (0.005)      (0.005)      (0.005)     
##   Investors x IncomeRange: $100,000+/Not employed                     -0.008       -0.004       -0.005      
##                                                                       (0.005)      (0.005)      (0.005)     
##   Investors x IncomeRange: Not displayed/Not employed                 -0.013*      -0.012*      -0.012*     
##                                                                       (0.005)      (0.005)      (0.005)     
##   Investors x Term: 36/12                                                          -0.233***    -0.233***   
##                                                                                    (0.004)      (0.004)     
##   Investors x Term: 60/12                                                          -0.253***    -0.253***   
##                                                                                    (0.004)      (0.004)     
##   Term: 36/12 x IsBorrowerHomeowner: True/False                                                  0.000      
##                                                                                                 (0.002)     
##   Term: 60/12 x IsBorrowerHomeowner: True/False                                                  0.005*     
##                                                                                                 (0.002)     
## ------------------------------------------------------------------------------------------------------------
##   R-squared                                                   0.962        0.962        0.964        0.964  
##   adj. R-squared                                              0.962        0.962        0.964        0.964  
##   sigma                                                       0.039        0.039        0.038        0.038  
##   F                                                      142572.058   105641.845   101835.267    95319.336  
##   p                                                           0.000        0.000        0.000        0.000  
##   Log-likelihood                                         203996.114   204016.227   205886.331   205917.834  
##   Deviance                                                  169.405      169.344      163.766      163.674  
##   AIC                                                   -407948.228  -407974.454  -411710.661  -411769.668  
##   BIC                                                   -407736.513  -407695.376  -411412.336  -411452.096  
##   N                                                      111680       111680       111680       111680      
## ============================================================================================================

The ANOVA test shows that all variables and interactions in the last model (m11) significantly improve the model, therefore all of them should be kept in the model. From the above table of regression models, we can see that MonthlyLoanPayment alone can explain 89.6% of the variance of LoanOriginalAmount. This is consistent with our plots, indicating that MonthlyLoanPayment is the strongest predictor. The second strong predictor is Term, which means that longer loan term is related to larger amount of loan. The final model explains 96.4% of the variance of loan amount.

Final Plots and Summary

Plot One

Description One

LoanOriginalAmount is the predicted variable in this dataset. In this Plot One, this variable has been standardized and log transformed, because the original variable was right skewed, and was not on the same scale with all the other numeric variables in the dataset.

Plot Two

Description Two

This Plot two describes the strongest correlation related to the predicted variable (r > 0.9). Also, according to the plot, the relationship may be better captured by multiple regression lines, indicating that a third variable may moderate the relationship between monthly loan payment and loan amount.

Plot Three

Description Three

This Plot Three confirms my speculation from Plot Two. The relationship between Monthly Payment and Loan Amount was moderated by Loan Term, which had three levels: 12 months, 36 months and 60 months. According to Plot Three, with longer loan term, the correlation between monthly payment and loan amount tends to be stronger. In addition, the color of points indicates that with higher income, the loan amount tends to be larger.

Reflection

The series of analysis have shown that the original amount of loan is related to many other variables. This dataset contains six numeric variables in addition to the amount of loan, as well as four factors. After examining the relationship between loan amount and these other variables with plots and multiple linear regression, I found that all these variables predicted the amount of loan.

A strong positive correlation between LoanOriginalAmount and MonthlyLoanPayment was found, which means that people with larger amount of original loan are more likely to schedule larger amount of monthly loan payment. A further analysis showed that this relationship was moderated by the length of loan term. With longer loan term, larger amount of loan tends to relate to even more monthly payment.

The amount of loan also tends to be larger with more investors, and this relationship was found to be moderated by both loan term and borrow’s income. With longer loan term, the correlation between number of investors and loan amount was not as strong as with shorter term; Similarly, with higher borrower’s income, the correlation was not as strong as with less income.

In general, higher income the borrower had, the larger amount of loan they tend to borrow. The employed borrowers and borrowers with higher credit score also tend to have larger amount of loan compared to borrowers with other employment status. Loan with longer term is also more likely has larger amount.

Home owners tend to have larger amount of loan, but it is also affected by the length of loan term. To be specific, whether borrowers have their own home does not matter much for a short term loan (12 months), but it affects longer term loan (36 and 60 months).

The amount of loan also tends to be larger if the borrower has made higher pre charge-off payment, or if the investors have paid more service fees.

The time when the loan was created in a year seems to affect the amount of loan, Q1 > Q4 > Q3 > Q2 although the difference between Q3 and Q2 was significant but very small.

Although the model has shown that using these variables and the interactions between several of them could predict a high proportion of the variance of loan amount, it still has limitations. Monthly payment explains more than 89% of the variance of loan amount and seems to be a very good predictor; However, there might be some variables that could predict both loan amount and monthly payment. For example, one could think that the borrower’s average expense per month, their marital status, how many children they have, and their estates, stock and other properties should also predict both their monthly payment and loan amount. These variables are not include in the original dataset, for the original dataset does not focus on more personal information of the borrowers. I think with more information considered, the model can be much improved.